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Abstract. Directed and undirected graphical models, also called Bayesian networks and 
Markov random fields, respectively, are important statistical tools in a wide variety of fields, 
ranging from computational biology to probabilistic artificial intelligence. We give an up- 
per bound on the number of inference functions of any graphical model. This bound is 
polynomial on the size of the model, for a fixed number of parameters, thus improving the 
exponential upper bound given by Pachter and Sturmfels |14). We also show that our bound 
is tight up to a constant factor, by constructing a family of hidden Markov models whose 
number of inference functions agrees asymptotically with the upper bound. Finally, we apply 
this bound to a model for sequence alignment that is used in computational biology. 
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1. Introduction 

Many statistical models seek, given a set of observed data, to find the hidden (unobserved) 
data which best explains these observations. In this paper we consider graphical models 
(both directed and undirected), a broad class that includes many useful models, such as hid- 
den Markov models (HMMs), pairwise-hidden Markov models, hidden tree models, Markov 
random fields, and some language models (background on graphical models will be given in 
Section I2.1J) . These graphical models relate the hidden and observed data probabilistically, 
and a natural problem is to determine, given a particular observation, what is the most likely 
hidden data (which is called the explanation). These models rely on parameters that are 
the probabilities relating the hidden and observed data. Any fixed values of the parameters 
determine a way to assign an explanation to each possible observation. This gives us a map, 
called an inference function, from observations to explanations. 

An example of an inference function is the popular 11 Did you mean" feature from google, 
which could be implemented as a hidden Markov model, where the observed data is what we 
type into the computer, and the hidden data is what we were meaning to type. Graphical 
models are frequently used in these sorts of probabilistic approaches to machine learning, 
pattern recognition, and artificial intelligence (see [7j for an introduction). 

Inference functions for graphical models are also important in computational biology [111 
Section 1.5], from where we originally drew inspiration for this paper. For example, consider 
the gene-finding functions, which were discussed in |13l Section 5] . These inference functions 
(corresponding to a particular HMM) are used to identify gene structures in DNA sequences. 
An observation in such a model is a sequence of nucleotides in the alphabet £' = {A, C, G, T}, 
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and an explanation is a sequence of l's and O's which indicate whether the particular nu- 
cleotide is in a gene or is not. We seek to use the information in the observed data (which 
we can find via DNA sequencing) to decide on the hidden information of which nucleotides 
are part of genes (which is hard to figure out directly). Another class of examples is that 
of sequence alignment models [111 Section 2.2]. In such models, an inference function is a 
map from a pair of DNA sequences to an optimal alignment of those sequences. If we change 
the parameters of the model, which alignments are optimal may change, and so the inference 
functions may change. 

A surprising conclusion of this paper is that there cannot be too many different inference 
functions, though the parameters may vary continuously over all possible choices. For exam- 
ple, in the homogeneous binary HMM of length 5 (see Section 12.11 for some definitions; they 
are not important at the moment), the observed data is a binary sequence of length 5, and 
the explanation will also be a binary sequence of length 5. At first glance, there are 

32 32 = 1 461 501 637 330 902 918 203 684 832 716 283 019 655 932 542 976 

possible maps from observed sequences to explanations. In fact, Christophe Weibel has 
computed that only 5266 of these possible maps are actually inference functions ^HI- Indeed, 
for an arbitrary graphical model, the number of possible maps from observed sequences to 
explanations is, at first glance, doubly exponential in the size of the model. The following 
theorem, which we call the Few Inference Functions Theorem, states that, if we fix the number 
of parameters, the number of inference functions is actually bounded by a polynomial in the 
size of the model. 

Theorem 1 (The Few Inference Functions Theorem). Let d be a fixed positive integer. Con- 
sider a graphical model with d parameters (see Definitions\^and\^for directed and undirected 
graphs, respectively). Let M be the complexity of the graphical model, where complexity is 
given by Definitions ^] and respectively. Then, the number of inference functions of the 
model is 0{M d{ - d -^). 

As we shall see, the complexity of a graphical model is often linear in the number of vertices 
or edges of the underlying graph. 

Different inference functions represent different criteria to decide what is the most likely ex- 
planation for each observation. A bound on the number of inference functions is important 
because it indicates how badly a model may respond to changes in the parameter values 
(which are generally known with very little certainty and only guessed at). Also, the poly- 
nomial bound given in Section |3] suggests that it might be feasible to precompute all the 
inference functions of a given graphical model, which would yield an efficient way to provide 
an explanation for each given observation. 

This paper is structured as follows. In Section |2] we introduce some preliminaries about 
graphical models and inference functions, as well as some facts about polytopes. In Section |3] 
we prove Theorem ^ In Section |I] we prove that our upper bound on the number of inference 
functions of a graphical model is sharp, up to a constant factor, by constructing a family of 
HMMs whose number of inference functions asymptotically matches the bound. In Sectional 
we show that the bound is also asymptotically tight on a model for sequence alignment which 
is actually used in computational biology. In particular, this bound will be quadratic on the 
length of the input DNA sequences. We conclude with a few remarks and possible directions 
for further research. 
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2. Preliminaries 



2.1. Graphical models. A statistical model is a family of joint probability distributions for 
a collection of discrete random variables W = (W±, . . . , W m ), where each Wi takes on values 
in some finite state space £j. A graphical model is represented by a graph where each vertex 
Vi corresponds to a random variable W{. The edges of the graph represent the dependencies 
between the variables. There are two major classes of graphical models depending on whether 
G is a directed or an undirected graph. 

We start by discussing directed graphical models, also called Bayesian networks, which are 
those represented by a finite directed acyclic graph G. Each vertex Vi has an associated 
probability map 



(1) 



Pi 



n 



[0,1]' 



\j: Vj a parent of Vi 



Given the states of each Wj such that Vj is a parent of Vi, the probability that V{ has a given 
state is independent of all other vertices that are not descendants of Vi, and this map pi gives 
that probability. In particular, we have the equality 

Prob(W = p) = J[ Prob (W { = pi, given that Wj = pj for all parents Vj of Vj) 



n (\pi (pji,---,pj k )] f 



where v 



'Oil ■ 



> u 3k 



are the parents of V{. Sources in the digraph (which have no parents) are 
generally given the uniform probability distribution on their states, though more general 
distributions are possible. See |11[ Section 1.5] for general background on graphical models. 

Example 2. The hidden Markov model (HMM) is a model with random variables X = 



(X 1 ,...,X n ) andY=(Y 1 



,Y n ) 



Edges go from X{ to Xi + \ and from X{ to Y{. 
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Figure 1. The graph of an HMM for n = 3. 

Generally, each Xi has the same state space E and each Y{ has the same state space £'. An 
HMM is called homogeneous if the px t , for 1 < i < n, are identical and the py t are identical. 
In this case, the px t each correspond to the same |S| x |S| matrix T = (t%j) (the transition 
matrix) and the py t each correspond to the same |S| x |E'| matrix S = (s«) (the emission 
matrix). 

In the example, we have partitioned the variables into two sets. In general graphical models, 
we also have two kinds of variables: observed variables Y = (Yi, Y%, . . . , Y n ) and hidden 
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variables X = (Xi,X2, ■ ■ ■ , X q ). Generally, the observed variables are the sinks of the directed 
graph, and the hidden variables are the other vertices, but this does not need to be the case. 
To simplify the notation, we make the assumption, which is often the case in practice, that 
all the observed variables take their values in the same finite alphabet and that all the 
hidden variables are on the finite alphabet E. 

Notice that for given £ and £' the homogeneous HMMs in this example depend only on a 
fixed set of parameters, and s«, even as n gets large. These are the sorts of models we 
are interested in. 

Definition 3. A directed graphical model with d parameters, B\, ■ ■ ■ ,64, is a directed graph- 
ical model such that each probability [p$ (pj 1 , . . . , Pj k )] in 0) is a monomial in 61, ... ,64- 

In what follows we denote by E the number of edges of the underlying graph of a graphical 
model, by n the number of observed random variables, and by q the number of hidden random 
variables. The observations, then, are sequences in (S') n and the explanations are sequences 
in £<?. Let I = |S| and I' = |£'|. 

For each observation r and hidden variables h, Prob (X = h, Y = r) is a monomial f^ T 
in the parameters 9\, . . . , 6&. Then for each observation r E (£')", the observed probability 
Prob(Y = r) is the sum over all hidden data h of Prob (X = h, Y = r), and so Prob(Y = r) 
is the polynomial f T = J2h fh,r i n t ne parameters 6±, . . . ,0^. 

Definition 4. The complexity, M, of a directed graphical model is the maximum, over all 
t, of the degree of the polynomial f T . 

In many graphical models, M will be a linear function of n, the number of observed variables. 
For example, in the homogeneous HMM, M = E = 2n — 1. 

Note that we have not assumed that the appropriate probabilities sum to 1. It turns out 
that the analysis is much easier if we do not place that restriction on our probabilities. At 
the end of the analysis, these restrictions may be added if desired (there are many models 
in use, however, which never place that restriction; these can no longer be properly called 
"probabilistic" models, but in fact belong to a more general class of "scoring" models which 
our analysis also encompasses). 

The other class of graphical models are those that are represented by an undirected graph. 
They are called undirected graphical models and are also known as Markov random fields. As 
for directed models, the vertices of the graph G correspond to the random variables, but the 
joint probability is now represented as a product of local functions defined on the maximal 
cliques of the graph, instead of transition probabilities pi defined on the edges. 

Recall that a clique of a graph is a set of vertices with the property that there is an edge 
between any two of them. A clique is maximal if it cannot be extended to include additional 
vertices without losing the property of being a clique (see Figure HJ) . 

Each maximal clique C of the graph G has an associated potential function 
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Figure 2. An undirected graph with maximal cliques {v\,V2}, {v2,vs}, 
{v2,v<i,v 5 }, {vs,v e }, and {v 5 ,v 6 }. 



Given the states pj of each Wj such that Vj is a vertex in the clique C, if we denote by pc 
the vector of such states, then ipc(pc) 1S a nonnegative real number. We denote by C the set 
of all maximal cliques C. 

Then, the joint probability distribution of all the variables Wi is given by 

Prob(W = p) = | J] MPC), 
cec 

where Z is the normalization factor 

p cec 

obtained by summing over all assignments of values to the variables p. 

The value of the function tpc(pc) for each possible choice of the states pi is given by the 
parameters of the model. We will be interested in models in which the set of parameters is 
fixed, even as the size of the graph gets large. 

Definition 5. An undirected graphical model with d parameters, 6\, . . . , 6d, is an undirected 
graphical model such that each probability tpc(pc) * n HP is a monomial in 6\, . . . ,6d- 

As in the case of directed models, the variables can be partitioned into observed variables 
Y = (Yi, Y2, . . . , Y n ) (which can be assumed to take their values in the same finite alphabet 
£') and hidden variables X = (X\, X2, ■ ■ ■ ,X q ) (which can be assumed to be on the finite 
alphabet X). For each observation r and hidden variables h, Z ■ Prob (X = h, Y = r) is a 
monomial fh T in the parameters 61, ... , Oj. Then for each observation r £ (S')" - , the observed 
probability Prob(Y = r) is the sum over all hidden data h of Prob (X = h, Y = r), and so 
Z ■ Prob(Y = r) is the polynomial f T = J2h /h,r i n the parameters 6\,...,6^. 

Definition 6. The complexity, M , of an undirected graphical model is the maximum, over 
all t, of the degree of the polynomial f T . 

It is usually the case for undirected models, as in directed, that M is a linear function of n. 

2.2. Inference functions. For fixed values of the parameters, the basic inference problem is 
to determine, for each given observation r, the value h £ E 9 of the hidden data that maximizes 
Prob(X = h I Y = r). A solution to this optimization problem is denoted h and is called an 
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explanation of the observation r. Each choice of parameter values (0i,02, ■ • • >0d) defines an 
inference function r i— ► h from the set of observations (S') n to the set of explanations S 9 . 

It is possible that there is more than one value of h attaining the maximum of Prob(X = 
h|Y = t). In this case, for simplicity, we will pick only one such explanation, according to 
some consistent tie-breaking rule decided ahead of time. For example, we can pick the least 
such h in some given total order of the set E 9 of hidden states. Another alternative would 
be to define inference functions as maps from (E') n to subsets of T, q . This would not affect 
the results of this paper, so for the sake of simplicity, we consider only inference functions as 
defined above. 

It is interesting to observe that the total number of maps — ► Yfl is (l q )W) n = l q ( l ') n , 
which is doubly-exponential in the length n of the observations. However, the vast majority 
of these maps are not inference functions for any values of the parameters. Before our results, 
the best upper bound in the literature is an exponential bound given in 14, Corollary 10]. 
Theorem^gives a polynomial upper bound on the number of inference functions of a graphical 
model. 

2.3. Polytopes. Here we review some facts about convex polytopes, and we introduce some 
notation. Recall that a polytope is a bounded intersection of finitely many closed halfspaces, 
or equivalently, the convex hull of a finite set of points. For the basic definitions about 
polytopes we refer the reader to |16j . 

Given a polynomial f(9) = J2iLi O'x' 1 ^'* ■ ■ ■ 0^'% its Newton polytope, denoted by NP(/), is 
defined as the convex hull in W d of the set of points {(oi,i, a,2,i, ■■■ , a>d,i) : i = 1, . . . , N}. 

For example, if f{9 1 ,9 2 ) = 20? + W\Q\ + 0i0| + 30i + 50|, then its Newton polytope NP(/) 
is given in Figure |31 




Figure 3. The Newton polytope of /(0i,0 2 ) = 20? + 30f0| + 0i0| + 30 x + 50|. 

Given a polytope P cM. d and a vector w G H. , the set of all points in P at which the linear 
functional x h- ► x ■ w attains its maximum determines a face of P. It is denoted 

(3) f&ce w (P) = {iGP : x ■ w > y ■ w ior all y £ P } . 

Faces of dimension (consisting of a single point) are called vertices, and faces of dimension 
1 are called edges. If d is the dimension of the polytope, then faces of dimension d — 1 are 
called facets. 

Let P be a polytope and F a face of P. The normal cone of P at F is 

N P (F) = {w G M d : i&ce w (P) = F }. 
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The collection of all cones Np(F) as F runs over all faces of P is denoted M{P) and is called 
the normal fan of P. Thus the normal fan M(P) is a partition of M. d into cones. The cones 
in Af(P) are in bijection with the faces of P, and if w £ Np(F) than the linear functional 
w ■ c is maximized on F. Figure 21 shows the normal fan of the polytope from Figure El 




FIGURE 4. The normal fan of a polytope. 



The Minkowski sum of two polytopes P and P' is defined as 

P + P' := {x + x' : x E P, x' E P'}. 
FigureElshows an example in 2 dimensions. The Newton polytope of the map f : 

R<* , ROT 

is defined as the Minkowski sum of the individual Newton polytopes of its coordinates, namely 
NP(f):=£ re(Sr NP CM- 



p + p' 



Figure 5. Two polytopes and their Minkowski sum. 

The common refinement of two or more normal fans is the collection of cones obtained as 
the intersection of a cone from each of the individual fans. For polytopes Pi, P2, . . . , P&, 
the common refinement of their normal fans is denoted A/" (Pi) A • • • A A/*(Pfc). The following 
lemma states the well-known fact that the normal fan of a Minkowski sum of polytopes is the 
common refinement of their individual fans (see |161 Proposition 7.12] or 4', Lemma 2.1.5]): 

Lemma 7. N{Pi + ■ ■ ■ + P k ) = M{Pi) A • • • A N(P k ). 



We finish with a result of Gritzmann and Sturmfels that will be useful later. It gives a bound 
on the number of vertices of a Minkowski sum of polytopes. 
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Theorem 8 ([!]). Let P\, P 2 , . . . , P^ be polytopes in M. d , and let m denote the number of 
non-parallel edges of Pi, . . . , Pk- Then the number of vertices of P\ + ■ ■ ■ + Pk is at most 




Note that this bound is independent of the number k of polytopes. 

3. AN UPPER BOUND ON THE NUMBER OF INFERENCE FUNCTIONS 

For fixed parameters, the inference problem of finding the explanation h that maximizes 
Prob(X = h|Y = r) is equivalent to identifying the monomial /g = 9 a ^ ,% 9 c ^' % ■ ■ -9 a d d ' % of f T 
with maximum value. Since the logarithm is a monotonically increasing function, the desired 
monomial also maximizes the quantity 

log(^ 1 ^ 2 a2 ' 1 • • • 9^) = a lti log(0i) + 02,* log(0 2 ) + • • • + a d4 log(^) 
= aijvi + a 2)i v 2 H h a dti v d , 

where we replace log(#j) with Vi. This is equivalent to the fact that the corresponding point 
{o>i,i,a,2,i, • • • , a<2,j) maximizes the linear expression vix\ + • • • + v d x d on the Newton polytope 
NP(/ r ). Thus, the inference problem for fixed parameters becomes a linear programming 
problem. 

Each choice of the parameters 9 = (9±, 9 2 , ■ ■ ■ , 9 d ) determines an inference function. If 
v = (t>i, t>2, • • • ,v d ) is the vector in M. d with coordinates Vi = log(#j), then we denote the 
corresponding inference function by 

For each observation r G its explanation & v (t) is given by the vertex of NP(/ T ) that 

is maximal in the direction of the vector v. Note that for certain values of the parameters 
(if v is perpendicular to a positive-dimensional face of NP(/ T )) there may be more than one 
vertex attaining the maximum. It is also possible that a single point (an, a 2 ,i, • • • , a dt i ) in the 
polytope corresponds to several different values of the hidden data. In both cases, we pick the 
explanation according to the tie-breaking rule determined ahead of time. This simplification 
does not affect the asymptotic number of inference functions. 

Different values of 9 yield different directions v, which can result in distinct inference func- 
tions. We are interested in bounding the number of different inference functions that a 
graphical model can have. Theorem ^ gives an upper bound which is polynomial in the size 
of the graphical model. In other words, extremely few of the l q ^ 1 functions — > 
are actually inference functions. 

We use the notation /(n) G 0(g(n)) to indicate that limsup,^^^ \f(n)/g(n)\ < 00. Similarly 
f(n) G ft(g(n)) means that liminf^^oo \ f(n)/g(n)\ > 0, and f[n) G Q(g(n)) denotes that 
f(n) belongs to both 0(g(n)) and Q(g(n)). 

Before proving Theorem^ observe that usually M, the complexity of the graphical model, is 
linear in n. For example, in the case of directed models, consider the common situation where 
M is bounded by E, the number of edges of the underlying graph (this happens when each 
edge "contributes" at most degree 1 to the monomials /h, T , as in the homogeneous HMM). 
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In most graphical models of interest, E is a linear function of n, so the bound becomes 
Q^ n d(d-i)y p or exam pi e) the homogeneous HMM has M = E = 2n — 1. 

In the case of undirected models, if each ipc(pc) is a parameter of the model, then /h )T = 
Z ■ Prob (X = h, Y = t) is a product of potential functions for each maximal clique of the 
graph, so M is bounded by the number of maximal cliques, which in many cases is also a 
linear function of the number of vertices of the graph. For example, this is the situation 
in language models where each word depends on a fixed number of previous words in the 
sentence. 

Proof. In the first part of the proof we will reduce the problem of counting inference functions 
to the enumeration of the vertices of a certain polytope. We have seen that an inference 
function is specified by a choice of the parameters, which is equivalent to choosing a vector 
v G M. d . The function is denoted <£ v : (S') n — > X 9 , and the explanation & v (t) of a given 
observation t is determined by the vertex of NP(/ T ) that is maximal in the direction of v. 
Thus, cones of the normal fan A/"(NP(/ T )) correspond to sets of vectors v that give rise to 
the same explanation for the observation r. Non-maximal cones (i.e., those contained in 
another cone of higher dimension) correspond to directions v for which more than one vertex 
is maximal. Since ties are broken using a consistent rule, we disregard this case for simplicity. 
Thus, in what follows we consider only maximal cones of the normal fan. 

Let v' = (vi, v' 2 , . . . , v' d ) be another vector corresponding to a different choice of parameters 
(see Figure EJ). By the above reasoning, & v (t) = <£ v /(t) if and only if v and v' belong to the 
same cone of jV(NP(/ r )). Thus, <J> V and <£ v / are the same inference function if and only if v 
and v' belong to the same cone of 7V(NP(/ T )) for all observations r £ (S') n . Consider the 
common refinement of all these normal fans, A T £(S') n ■^(NP(/t))- Then, $ v and <£ v / are the 
same function exactly when v and v' lie in the same cone of this common refinement. 

This implies that the number of inference functions equals the number of cones in 

/\ A/-(NP(/ T )). 

re(E') n 

By Lemma d this common refinement is the normal fan of NP(f) = X^re(S') n NP(/t)> the 
Minkowski sum of the polytopes NP(/ r ) for all observations r. It follows that enumerating 
inference functions is equivalent to counting vertices of NP(f). In the remaining part of the 
proof we give an upper bound on the number of vertices of NP(f). 

Note that for each r, the polytope NP(/ T ) is contained in the hypercube [0,M] a! , since by 
definition of M, each parameter 6{ appears in f T with exponent at most M. Also, the 
vertices of NP(/ T ) have integral coordinates, because they are exponent vectors. Polytopes 
whose vertices have integral coordinates are called lattice polytopes. It follows that the edges 
of NP(/ T ) are given by vectors where each coordinate is an integer between — M and M. 
There are only (2M + l) d such vectors, so this is an upper bound on the number of different 
directions that the edges of the polytopes NP(/ T ) can have. 

This property of the Newton polytopes of the coordinates of the model will allow us to give 
an upper bound on the number of vertices of their Minkowski sum NP(f ). The last ingredient 
that we need is Theorem El In our case we have a sum of polytopes NP(/ r ), one for each 
observation r € (S') n , having at most (2M + l) d non-parallel edges in total. Hence, by 
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Figure 6. Two different inference functions, <5 V (left column) and & v i (right 
column). Each row corresponds to a different observation. The respective 
explanations are given by the marked vertices in each Newton polytope. 

Theorem |H1 the number of vertices of NP(f) is at most 

2 g/(2M + l) rf -r 

3=0 ^ J 

As M goes to infinity, the dominant term of this expression is 

Thus, we get an 0(Af^ -1 ') upper bound on the number of inference functions of the graph- 



ical model. 



□ 



In the next section we will show that the bound given in Theorem ^ is tight up to a constant 
factor. 



4. A LOWER BOUND 

As before, we fix d, the number of parameters in our model. The Few Inferences Function 
Theorem tells us that the number of inference functions is bounded from above by some 
function cM d ( d ~ 1 \ where c is a constant (depending only on d) and M is the complexity 
of the model. Here we show that that bound is tight up to a constant, by constructing a 
family of graphical models whose number of inference functions is at least c' M d ^ d ~ l \ where 
d is another constant. In fact, we will construct a family of hidden Markov models with this 
property. To be precise, we have the following theorem. 
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Theorem 9. Fix d. There is a constant c' = c'(d) such that, given n € Z + , there exists 
an HMM of length n, with d parameters, Ad + A hidden states, and 2 observed states, such 
that there are at least c ' n d ( d - 1 ) distinct inference functions. (For this HMM, M is a linear 
function of n, so this also gives us the lower bound in terms of M). 

In Section 14,11 we prove Theorem |§J This proof requires several lemmas that we will meet 
along the way, and these lemmas will be proved in Section T4, 21 Lemma which is interesting 
in its own right as a statement in the geometry of numbers is proved in jHJ. 

4.1. Proof of Theorem |9j Given n, we first construct the appropriate HMM, A4 n , using 
the following lemma. 

Lemma 10. Given n £ Z +; there is an HMM, A4 n , of length n, with d parameters, Ad + 4 
hidden states, and 2 observed states, such that for any a G Z+ with £^ < n, there is an 
observed sequence which has one explanation if 

ailog(0i) + -.. + a d log(0 d ) >0 

and another explanation if a% log(#i) + • • • + a,i^og(6 ( i) < 0. 

This means that, for the HMM M. n , the decomposition of (log-)parameter space into inference 
cones includes all of the hyperplanes {x : (a, x) = 0} such that a £ 7L d + with ^ cij < n. Call 
the arrangement of these hyperplanes H n . It suffices to show that the arrangement TL n consists 
of at least c'n di ' d ~ 1 ^ chambers (full dimensional cones determined by the arrangement). There 
are c\n d ways to choose one of the hyperplanes from Ti n , for some constant c\. Therefore 
there are cf~ l n d ^ d ~^ ways to choose d— 1 of the hyperplanes; their intersection is, in general, 
a 1-dimensional face of TL n (that is, the intersection is a ray which is an extreme ray for 
the cones it is contained in). It is quite possible that two different ways of choosing d — 1 
hyperplanes give the same extreme ray. The following lemma says that some constant fraction 
of these choices of extreme rays are actually distinct. 

Lemma 11. Fix d. Given n, let 7i n be the hyperplane arrangement consisting of the hy- 
perplanes of the form {x : (a,x) = 0} with a £ Z^ and ^t a « < n - Then the number of 
1-dimensional faces ofTC n is C2n d ^ d ~ l \ for some constant C2- 

Each chamber will have a number of these extreme rays on its boundary. The following 
lemma gives a constant bound on this number. 

Lemma 12. Fix d. Given n, define 7i n as above. Each chamber of 7i n has at most 2 d ( d_1 ) 
extreme rays. 

Conversely, each ray is an extreme ray for at least 1 chamber. Therefore there are at least 
od(d-i) n d ( d ~ 1 ^ chambers, and Theorem 03 is proved. □ 



In proving Lemma ITU we will need one more lemma. This lemma is interesting in its own 
right as a probabilistic statement about integer lattices, and so is proved in a companion 
paper Given a set S C Z d of integer vectors, span R (5) is a linear subspace of R d and 
span K (S') n Z d is a sublattice of Z d . We say that S is primitive if S is a Z-basis for the lattice 
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span K (S') nZ d . Equivalently, a set S is primitive if and only if it may be extended to a Z-basis 
of all of 7L d (see 0). 

We imagine picking each vector in S uniformly at random from some large box in M. d . As the 
size of the box approaches infinity, the following lemma will tell us that the probability that 
S is primitive approaches 

1 



((d)C(d-l)■■■((d-m + l) , 
where \S\ = m and ("(a) is the Riemann Zeta function X^Si 

Lemma 13 (from |3|). Let d and m be given, with m < d. For n E Z +; 1 < k < m, 

and 1 < i < d, let 6 n fcj G Z. For a given n, choose integers Ski uniformly (and indepen- 
dently) at random from the set b n ^^ < Ski < & n ,fc,i + n - Let Sk = (ski, ■ ■ ■ , Skd) and let 
S = {si,s 2 , ■ ■ ■ ,s m }. 

If \bn,k,i\ is bounded by a polynomial in n, then, as n approaches infinity, the probability that 
S is a primitive set approaches 

1 

C{d)C{d-l)---C{d-m + \y 
where £(a) is the Riemann Zeta function X^Si 

When 771 = 1, this lemma gives the probability that a d-tuple of integers are relatively 
prime as For m = l,d = 2, this is a classic result in number theory (see PP), and for 

m = 1, d > 2, this was proven in Note also that, if m = d and we choose S of size m, 
then the probability that S is primitive (i.e., that it is a basis for Z d ) approaches zero. This 
agrees with the lemma in the sense that we would expect the probability to be 

1 



C(d)C(d-l) ■•■«!)' 

but C(l) does not converge. 
4.2. Proofs of Lemmas. 

Proof of Lemma MlA Given d and n, define a length n HMM with parameters 9i,...,8d, as 
follows. The observed states will be S and C (for "start of block," and "continuing block," 
respectively). The hidden states will be Sj, Cj, and for 1 < i < d + 1 (think of Sj and 
as "start of the ith block" and q and c[ as "continuing the ith. block"). 

Here is the idea of what we want this HMM to do: if the observed sequence has S's in position 
1, a\ + 1, a% + d2 + I; • • •> an d ai + • • • + Orf + 1 and C's elsewhere, then there will be only 
two possibilities for the sequence of hidden states, either 

t = Sl Cj '^Cl S 2 C2_^-_C2,- • • Srf Crf-^-Crf 5 d+ i C d+ i • • • C d+ i 
a,i— 1 02— 1 d d — 1 n— ai 0(j— 1 

or 

./ / / ill i ii ii i i 

I — S 1 C 1 - J C X S 2 C 2 - J Oy ■ ■ S d C d - j Cq S d+l C d+l ■ ■ ■ C d+1 ^ . 

a\— 1 02— 1 a<f — 1 n— ai a d — 1 

We will also make sure that i has a priori probability • • • Q a d d and t' has a priori probability 
1. Then t is the explanation if a± log(^i) + • • • + adlog(^) > and t! is the explanation if 
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a\ log(#i) + • • • + a^log(^) < 0. Remember that we are not constraining our probability 
sums to be 1. A very similar HMM could be constructed that obeys that constraint, if 
desired. To simplify notation it will be more convenient to treat the transition probabilities 
as parameters that do not necessarily sum to one at each vertex, even if this forces us to use 
the term "probability" somewhat loosely. 

Here is how we set up the transitions/emmisions. Let and s-, for 1 < i < d + 1, all emit S 
with probability 1 and C with probability 0. Let q and d t emit C with probability 1 and S 
with probability 0. Let Sj, for 1 < i < d, transition to Cj with probability 9i and transition 
to everything else with probability 0. Let Sd+i transition to Cd+i with probability 1 and to 
everything else with probability 0. Let s[, for 1 < i < d + 1, transition to c- with probability 
1 and to everything else with probability 0. Let q, for 1 < i < d, transition to q with 
probability 9{, to Sj+i with probability 6i, and to everything else with probability 0. Let Cd+i 
transition to Cd+i with probability 1, and to everything else with probability 0. Let c' { , for 
1 < i < d transition to d i with probability 1, to Sj+i with probability 1, and to everything 
else with probability 0. Let c' d+l transition to c' d+1 with probability 1 and to everything else 
with probability 0. 

Starting with the uniform probability distribution on the first hidden state, this does exactly 
what we want it to: given the correct observed sequence, t and t' are the only explanations, 
with the correct probabilities. □ 

Proof of Lemma We are going to pick d — 1 vectors which correspond to 

the d — 1 hyperplanes {x : (aW ; x) =0} that will intersect to give us extreme rays of our 
chambers. We will restrict the region from which we pick each G Z d . Let 



6« = (1,1,... ,1) - i 



2 6u 

for 1 < i < d — 1, where is the ith. standard basis vector. Let s = id l +i - For 1 < i < d — 1, 
we will choose G Z d such that 



(4) 



n 



d 



n 

< I s - 

oo CL 



Note that cr- < n, so there are observed sequences which give us the hyperplanes {x : 



(aW , x) = 0}. Note also that there are (^) d ( d - 1 ) n d ( d - 1 ) choices for the (d— l)-tuple of vectors 
(o^ 1 ', . . . , a^" 1 )). To prove this lemma, we must then show that a positive fraction of these 
actually give rise to distinct extreme rays Pl^Z^jx : (a^\x) = 0}. 

First, we imagine choosing the a^' uniformly at random in the range given by @, this 
probability distribution meets the condition in the statement of Lemma 1131 as n approaches 
infinity. Therefore, there is a positive probability that 

(5) {a (i) : 1 < % < d - 1} form a basis for the lattice Z d n span{a (i) : 1 < i < d - 1}, 
and this probability approaches 

1 

C(d)C(d-i)---C(2) - 

Second, we look at all choices of a" G 7L d such that (J1J and (JSJ) hold. There are C2n d ( d ~ 1 ' > 
of these, for some constant C2. We claim that these give distinct extreme rays fliJii^ : 
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(aW,x) = 0}. Indeed, say that a" and cW are both chosen such that @ and (j^J) hold and 
such that 

d-1 d-l 

p|{a;: (o (<) , x) = 0} = f] {x : <c W ,x} = 0}. 

i=l i=l 

We will argue that and are "so close" that they must actually be the same. 

Let j, for 1 < j < d — 1 be given. We will prove that sW = c( J ). Since 

d-1 

<a W ,x) = 0} C {x : (c (j '\ x) = 0}, 

i=l 

we know that c^) is in spanja^- 1 : 1 < % < d — 1}, and therefore 

c (i) G z d n span{a (i) : 1 < i < d}. 

Let g = c& - a®. Then 



Slloo < 2 ~ S > 



'd 

by Condition Q) for and and 

5 = aia (1) + --- + Q d _ia^ 1 \ 

for some on E Z, by Condition (J5J) for a^. We must show that g = 0. By reorder- 
ing indices and possibly considering —g, we may assume that ai,...,otk > 0, for some /c, 
a/c+i, . . . , ad_i < 0, and |ai| is maximal over all \ai\, 1 < i < d — 1. 

Examining the first coordinate of 5, we have that 
n 

-2-s < ffi 

(1) 1 , (d-l) 
= a\a\ H h arf_ia^ 

< ai-(b\ ' +s) H hafc^C&i + s) + a fc+1 ~(&i - s) H ha rf _i-(&i - s) 



[ai H h a rf _i - -ai + s(|ai| H h |«d~i I )] (using 6 W = (!,...,!)- ^e^) 



71 

= - L ^ ± , , 2 _ , „ VI „„ , , v 2 

71 ] 

- ^ i ai H ^ ad -! ~ + ^ ~~ • 

Negating and dividing by ^, 

(6) -(ai H h «d-i) + — (d — l)sai < 2s. 

Similarly, examining the (A: + l)-st coordinate of g, we have 
n 

2^s > 5fc+i 

(1) , 1 {d-1) 
= "lflfc+i + • • • + a d -ia k+1 

> «i^( 6 fc+i -*) + •■■ + a kj( b Ui ~ s ) + aik+ljOi+i '+*) + "■ + ad-i^(^+i + 
= - [ai H h a d -i - 2 a k+i — s{\a%\ H h |a<z-l I)] 

?1 1 

> - [a% H h ay-i - -afc+i - (d - l)sai] , 
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and so 

(7) (ax H h a rf _i) - ^a k+1 - (d - l)sai < 2s. 

Adding the equations © and 0, 



and so, since s = 4^4, 



1 1 

-011 ~ 2 a k+i - 2 ( d ~ < 4s, 



-qi - -ajfc+i < 



d+1 2 T d+1 
Therefore, since a^+i < 0, we have that a.\ < 1 and so ct\ = 0. Since |qi| was maximal over 
all we have that g = 0. Therefore a") = c^, and the lemma follows. □ 

Proof of Lemma MU Suppose N > 2 d ( d ~ l \ and suppose for 1 < i < N and 1 < j < 

d-l, are such that a (i ' j) G Y?k=\ a k^ < n ' and the N ra y s 

d-l 

r « = : (o^,x> = 0} 

i=i 

are the extreme rays for some chamber. Then, since N > 2^-^, there are some i and i 
such that 

Oj. ' ; = ; mod 2, 

for 1 < j < d — 1 and 1 < k < d (i.e., all of the coordinates in all of the vectors have the 
same parity). Then let 



c 0) 



2 



for 1 < j < rf — 1. Then £ and £)f =1 < n, and the 



ray 



r W +r (i') 



r= f|{ar: (c^,x) =0} 
i=i 

is in the chamber, which is a contradiction. □ 



5. Inference functions for sequence alignment 



In this section we give an application of Theorem ^ to a basic model for sequence align- 
ment. Sequence alignment is one of the most frequently used techniques in determining the 
similarity between biological sequences. In the standard instance of the sequence alignment 
problem, we are given two sequences (usually DNA or protein sequences) that have evolved 
from a common ancestor via a series of mutations, insertions and deletions. The goal is to 
find the best alignment between the two sequences. The definition of "best" here depends 
on the choice of scoring scheme, and there is often disagreement about the correct choice. 
In parametric sequence alignment, this problem is circumvented by instead computing the 
optimal alignment as a function of variable scores. Here we consider one such scheme, in 
which all matches are equally rewarded, all mismatches are equally penalized and all spaces 
are equally penalized. Efficient parametric sequence alignment algorithms are known (see for 
example Chapter 7]). Here we are concerned with the different inference functions that 
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can arise when the parameters vary. For a detailed treatment on the subject of sequence 
alignment, we refer the reader to 

Given two strings a 1 and a 2 of lengths m and respectively, an alignment is a pair of equal 
length strings (// , /j, 2 ) obtained from a , a 2 by inserting dashes " — " in such a way that there 
is no position in which both /z 1 and fx 2 have a dash. A match is a position where /i 1 and [i 2 
have the same character, a mismatch is a position where /z 1 and // 2 have different characters, 
and a space is a position in which one of /i 1 and /x 2 has a dash. A simple scoring scheme 
consists of two parameters a and f3 denoting mismatch and space penalties respectively. The 
reward of a match is set to 1. The score of an alignment with z matches, x mismatches, and 
y spaces is then z — xa — y(3. Observe that these numbers always satisfy 2z + 2x + y = n\ + n2- 

This model for sequence alignment can be translated into a probabilistic model, and is a 
particular case of a so-called pair hidden Markov model. The problem of determining the 
highest scoring alignment for given values of a and (5 is equivalent to the inference problem 
in the pair hidden Markov model, with some parameters set to functions of a and (3, or to 
or 1. In this setting, an observation is a pair of sequences r = (<r , er 2 ), and the number of 
observed variables is n = n\ + rii. An explanation is then an optimal alignment, since the 
values of the hidden variables indicate the positions of the spaces. 

In the rest of this chapter we will refer to this as the 2-parameter model for sequence alignment. 
Note that it actually comes from a 3-parameter model where the reward for a match has, 
without loss of generality, been set to 1. The Newton polytopes of the coordinates of the 
model are defined in a 3-dimensional space, but in fact they lie on a plane, as we will see 
next. Thus, the parameter space has only two degrees of freedom. 

For each pair of sequences r, the Newton polytope of the polynomial f T is the convex hull 
of the points (x, y, z) whose coordinates are the number of mismatches, spaces, and matches, 
respectively, of each possible alignment of the pair. This polytope lies on the plane 2z + 2x + 
y = m +ri2, so no information is lost by considering its projection onto the xy-plane instead. 
This projection is just the convex hull of the points (x, y) giving the number of mismatches 
and spaces of each alignment. For any alignment of sequences of lengths n\ and ri2, the 
corresponding point (x,y) lies inside the square [0,n] 2 , where n = n\ + ri2- Therefore, since 
we are dealing with lattice polygons inside [0, n] 2 , it follows from Theorem^that the number 
of inference functions of this model is 0(n 2 ). Next we show that this quadratic bound is 
tight, even in the case of the binary alphabet. 

Proposition 14. Consider the 2-parameter model for sequence alignment for two observed 
sequences of length n and let £' = {0, 1} be the binary alphabet. Then, the number of inference 
functions of this model is 0(n 2 ). 

Proof. The above argument shows that cn 2 is an upper bound on the number of inference 
functions of the model, for some constant c. To prove the proposition, we will argue that 
there is some constant c' such that there are at least c'n 2 such functions. 

Since the two sequences have the same length, the number of spaces in any alignment is even. 
For convenience, we define y' = y/2 and = 2(3, and we will work with the coordinates 
{x, y', z) and the parameters a and . The value y' is called the number of insertions (half 
the number of spaces), and (5' is the insertion penalty. For fixed values of a and (5' , the 
explanation of an observation r = (a 1 , a 2 ) is given by the vertex of NP(/ T ) that is maximal 
in the direction of the vector (—a, — /?', 1). In this model, NP(/ T ) is the convex hull of the 
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points (x,y',z) whose coordinates are the number of mismatches, insertions and matches of 
the alignments of a 1 and a 2 . 

The argument in the proof of Theorem ^ shows that the number of inference functions of this 
model is the number of cones in the common refinement of the normal fans of NP(/ T ), where 
r runs over all pairs of sequences of length n in the alphabet Since the polytopes NP(/ r ) 
lie on the plane x + y' + z = n, it is equivalent to consider the normal fans of their projections 
onto the y'z-plane. These projections are lattice polygons contained in the square [0,n] 2 . We 
denote by P T the projection of NP(/ T ) onto the y'z-plane. 

We will construct a collection of pairs of binary sequences r = (a 1 , a 2 ) so that the total 
number of different slopes of the edges of the polygons NP(/ r ) is Q,(n 2 ). This will imply 
that the number of cones in A A/"(NP(/ r )) is fi(n 2 ), where r ranges over all pairs of binary 
sequences of length n. 

We claim that for any positive integers u and v with u < v and 6v — 2u < n, there exists a 
pair r of binary sequences of length n such that P T has an edge of slope u/v. This will imply 
that the number of different slopes created by the edges of the polygons P T is Q(n 2 ). 

Thus, it only remains to prove the claim. Given positive integers u and v as above, let 
a := 2v, b := v — u. Assume first that n = 6v — 2u = 2a + 2b. Consider the sequences 

a 1 = a l b b l a , <j 2 = l a 6 l 6 a , 

where a indicates that the symbol is repeated a times. Let r = (a 1 , a 2 ). Then, it is 
not hard to see that the polygon P T for this pair of sequences has four vertices: vq = (0, 0), 
v\ = (6, 36), V2 = (a + 6, a + b) and V3 = (n, 0). The slope of the edge between v\ and V2 is 
(a — 2b) /a = u/v. 

If n > 6v — 2u = 2a + 2b, we just append o n-2a-26 to both sequences a 1 and a 2 . In this 
case, the vertices of P T are (0, n — 2a — 26), (6, n — 2a + 6), (a + 6, n — a — 6), (n, 0) and 
(n - 2a - 26,0). 




Figure 7. A pair of binary sequences of length 18 giving the slope 3/7 in 
their alignment polytope. 
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Note that if v — u is even, the construction can be done with sequences of length n = 3v — u 
by taking a := v, b := (v — u)/2. Figure shows the alignment graph and the polygon P T for 
a = 7, 6 = 2. □ 

In most cases, one is interested only in those inference functions that are biologically mean- 
ingful. In our case, meaningful values of the parameters occur when a, /3 > 0, which means 
that mismatches and spaces are penalized instead of rewarded. Sometimes one also requires 
that a < (3, which means that a mismatch should be penalized less than two spaces. It is 
interesting to observe that our construction in the proof of Proposition ^] not only shows 
that the total number of inference functions is 0(n 2 ), but also that the number of biologically 
meaningful ones is still Q(n 2 ). This is because the different rays created in our construction 
have a biologically meaningful direction in the parameter space. 

6. Final remarks 

An interpretation of Theorem ^ is that the ability to change the values of the parameters of 
a graphical model does not give as much freedom as it may appear. There is a very large 
number of possible ways to assign an explanation to each observation. However, only a tiny 
proportion of these come from a consistent method for choosing the most probable explanation 
for a certain choice of parameters. Even though the parameters can vary continuously, the 
number of different inference functions that can be obtained is at most polynomial in the 
number of edges of the model, assuming that the number of parameters is fixed. 

In the case of sequence alignment, the number of possible functions that associate an align- 
ment to each pair of sequences of length n is doubly-exponential in n. However, the number 
of functions that pick the alignment with highest score in the 2-parameter model, for some 
choice of the parameters a and /3, is only 0(n 2 ). Thus, most ways of assigning alignments 
to pairs of sequences do not correspond to any consistent choice of parameters. If we use a 
model with more parameters, say d, the number of inference functions may be larger, but 
still polynomial in n, namely 0(n d ( d ~^). 

Having shown that the number of inference functions of a graphical model is polynomial in the 
size of the model, an interesting next step would be to find an efficient way to precompute 
all the inference functions for given models. This would allow us to give the answer (the 
explanation) to a query (an observation) very quickly. It follows from this chapter that it 
is computationally feasible to precompute the polytope NP(f), whose vertices correspond to 
the inference functions. However, the difficulty arises when we try to describe a particular 
inference function efficiently. The problem is that the characterization of an inference function 
involves an exponential number of observations. 
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